[None][feat] Support request-scoped capacity-only KV cache compaction#15697
[None][feat] Support request-scoped capacity-only KV cache compaction#15697Hudayday wants to merge 2 commits into
Conversation
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthrough
ChangesKV-cache compression reclaim
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py`:
- Around line 2497-2502: The no-free fallback in
kv_cache_manager_v2._KVCacheManagerV2 should not ignore the boolean result from
kv_cache.resize(None, req.max_beam_num_tokens - evicted). After the warning in
the fallback branch, check the return value just like the normal resize path and
treat a failed resize as fatal or otherwise handle it consistently so the
request state stays aligned with the live _KVCache state.
In `@tests/unittest/_torch/pyexecutor/test_kv_cache_v2_compression_reclaim.py`:
- Around line 77-83: The current test coverage only exercises completion with
evicted=0, so the combined evicted-and-completing path in the reclaim logic is
still untested. Add a test in test_kv_cache_v2_compression_reclaim.py using
_fake_manager, _run, and _req with evicted > 0 and a completing state such as
LlmRequestState.GENERATION_COMPLETE or CONTEXT_INIT, and assert that the request
still calls resize(None, max_beam - 1) while fork() is not called. This should
verify the guard in the kv cache reclaim behavior when both conditions are
present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f47017d9-645a-4086-ab61-1067b72e308b
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.pytests/integration/test_lists/test-db/l0_a10.ymltests/unittest/_torch/pyexecutor/test_kv_cache_v2_compression_reclaim.py
|
PR_Github #56237 [ run ] triggered by Bot. Commit: |
|
PR_Github #56237 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #56247 [ run ] triggered by Bot. Commit: |
|
PR_Github #56247 [ run ] completed with state |
5cdbbf3 to
095f9ff
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #56292 [ run ] triggered by Bot. Commit: |
2f10293 to
c11ca19
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #56307 [ run ] triggered by Bot. Commit: |
|
PR_Github #56292 [ run ] completed with state |
|
PR_Github #56307 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #56337 [ run ] triggered by Bot. Commit: |
|
PR_Github #56337 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #56371 [ run ] triggered by Bot. Commit: |
|
PR_Github #56371 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
/bot run --disable-fail-fast |
2 similar comments
|
/bot run --disable-fail-fast |
|
/bot run --disable-fail-fast |
|
PR_Github #56516 [ run ] triggered by Bot. Commit: |
|
PR_Github #56516 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #56605 [ run ] triggered by Bot. Commit: |
|
PR_Github #56605 [ run ] completed with state |
|
[by Codex] @lowsfer Could you review this PR? Thanks! |
ef05e64 to
b053331
Compare
Signed-off-by: Hudayday <32944717+Hudayday@users.noreply.github.com>
b053331 to
5ba3c17
Compare
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
/bot run --disable-fail-fast |
|
PR_Github #57196 [ run ] triggered by Bot. Commit: |
|
PR_Github #57196 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
Signed-off-by: Hudayday <32944717+Hudayday@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #57359 [ run ] triggered by Bot. Commit: |
Description
KV-cache compression can physically compact a request's live KV into a
smaller dense prefix. During generation, the V2 runtime adapter normally
passes the request's monotonically growing logical token count as
history_length. That behavior is correct for ordinary full attention, but itcauses a compacted request to grow back toward its logical length and prevents
the reclaimed pages from remaining available to the KV pool.
The V2 core already supports the required operation: shrink
capacitywhilepreserving the existing committed history with
resize(capacity, history_length=None).This PR adds only the request-scoped runtime-adapter plumbing needed by
KV-cache compression. It does not change the V2 core and does not add a
fork/rewind or public manager API.
Changes
py_kv_cache_generation_capacity_only=True.history_length=None, preservingthe core's current committed history while allowing physical capacity to
shrink.
(target_capacity, published_capacity, event)compaction marker.the current rewind:
target + (live_capacity - published_capacity) - rewind.resize()succeeds so a failed resizecan be retried.
Compatibility
Requests that do not explicitly opt in call
resize()with the same capacityand history arguments as before. The opt-in check is request scoped and
fail-closed.
There is no public API change, no V2 core change, and no manager-level
compression state. The compression implementation owns publishing the
request marker and clearing its generation capacity-only flag at request
completion.
Validation
python3 -m compileall: passedgit diff --check: passedruff check: passedruff format --check: passed